283 research outputs found

    Application-level differential checkpointing for HPC applications with dynamic datasets

    Get PDF
    High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time.This project has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) and the Horizon 2020 (H2020) funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST); and from the European Unions Horizon 2020 research and innovation programme under the LEGaTO Project (legato- project.eu), grant agreement No 780681.Peer ReviewedPostprint (author's final draft

    Discovering the Ethereum2 P2P network

    Get PDF
    Achieving the equilibrium between scalability, sustainability, and security while keeping decentralization has prevailed as the target solution for decentralized blockchain applications over the last years. Several approaches have been proposed by multiple blockchain teams to achieve it, Ethereum being among them. Ethereum is on the path of a major protocol improvement called Ethereum 2.0 (Eth2), implementing Sharding and introducing the Proof-of-Stake (PoS). As the change of consensus mechanism is a delicate matter, this improvement will be achieved through different phases, the first of which is the implementation of the Beacon Chain. As Ethereum1, Eth2 relies on a decentralized peer-to-peer (p2p) network for the message distribution. Up to date, we estimate that there are around 5.000 nodes in the Eth2 main net geographically distributed. However, the topology of this one still prevails unknown. In this paper, we present the results obtained from the analysis we performed on the Eth2 p2p network. Describing the topology of the network, as possible hazards that this one implies.This work has been supported by the Ethereum Foundation under Grant FY20-0198.Peer ReviewedPostprint (author's final draft

    Armiarma: Ethereum2 Network Monitoring Tool

    Get PDF
    Achieving the equilibrium between scalability, sustainability and security has prevailed as the ideal solution for decentralized blockchain applications over the last years. Several approaches have been proposed being Ethereum a solid proposal among them. Ethereum is on the path of a major protocol improvement called Ethereum 2.0 (Eth2), implementing Sharding and introducing the Proof-of-Stake (PoS). As the change of consensus mechanism is a delicate matter, this improvement will be achieved through different phases, the first of which is the implementation of the Beacon Chain. The implementation of the latest has been stated with the recent launch of the Eth2 main net. In this work, we introduce an Eth2 network monitor tool, called Armiarma, used to generate a complete analysis of the p2p network of the Eth2 main net. In this paper, we present some of the results of what this Eth2 network monitor can achieve

    Performance Study of Non-volatile Memories on a High-End Supercomputer

    Get PDF
    The first exa-scale supercomputers are expected to be operational in China, USA, Japan and Europe within the early 2020’s. This allows scientists to execute applications at extreme scale with more than 1018 floating point operations per second (exa-FLOPS). However, the number of FLOPS is not the only parameter that determines the final performance. In order to store intermediate results or to provide fault tolerance, most applications need to perform a considerable amount of I/O operations during runtime. The performance of those operations is determined by the throughput from volatile (e.g. DRAM) to non-volatile stable storage. Regarding the slow growth in network bandwidth compared to the computing capacity on the nodes, it is highly beneficial to deploy local stable storage such as the new non-volatile memories (NVMe), in order to avoid the transfer through the network to the parallel file system. In this work, we analyse the performance of three different storage levels of the CTE-POWER9 cluster, located at the Barcelona Supercomputing Center (BSC). We compare the throughputs of SSD, NVMe on the nodes to the GPFS under various scenarios and settings. We measured a maximum performance on 16 nodes of 83 GB/s using NVMe devices, 5.6 GB/s for SSD devices and 4.4 GB/s for writes to the GPFS.This project has received funding from the European Union’sHorizon 2020 research and innovation programme under the Marie Sklodowska-Curiegrant agreement No 708566 (DURO). Part of the research presented here has receivedfunding from the European Union’s Seventh Framework Programme (FP7/2007-2013)and the Horizon 2020 (H2020) funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST). The present publication reflects only the authors’views. The European Commission is not liable for any use that might be made ofthe information contained therein.Peer ReviewedPostprint (author's final draft

    Resource analysis of Ethereum 2.0 clients

    Get PDF
    Scalability is a common issue among the most used permissionless blockchains, and several approaches have been proposed to solve this issue. Tackling scalability while preserving the security and decentralization of the network is an important challenge. To deliver effective scaling solutions, Ethereum is on the path of a major protocol improvement called Ethereum 2.0 (Eth2), which implements sharding. As the change of consensus mechanism is an extremely delicate matter, this improvement will be achieved through different phases, the first of which is the implementation of the Beacon Chain. For this, a specification has been developed, and multiple groups have implemented clients to run the new protocol. This work analyzes the resource usage behavior of different clients running as Eth2 nodes, comparing their performance and analyzing differences. Our results show multiple important network perturbations and how different clients react to them. We discuss the differences between Eth2 clients and their limitations.This work has been supported by the Ethereum Foundation under Grant FY20-0198.Peer ReviewedPostprint (author's final draft

    Resource Analysis of Ethereum 2.0 Clients

    Get PDF
    Scalability is a common issue among the most used permissionless blockchains, and several approaches have been proposed accordingly. As Ethereum is set to be a solid foundation for a decentralized Internet web, the need for tackling scalability issues while preserving the security of the network is an important challenge. In order to successfully deliver effective scaling solutions, Ethereum is on the path of a major protocol improvement called Ethereum 2.0 (Eth2), which implements sharding. As the change of consensus mechanism is an extremely delicate matter, this improvement will be achieved through different phases, the first of which is the implementation of the Beacon Chain. For this, a specification has been developed and multiple groups have implemented clients to run the new protocol. In this work, we analyse the resource usage behaviour of different clients running as Eth2 nodes, comparing their performance and analysing differences. Our results show multiple network perturbations and how different clients react to it

    Towards Ad Hoc Recovery for Soft Errors

    Get PDF
    The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (DRAM) account for a noticeable share of the failures in supercomputers. Hardware mechanisms, such as error correcting code (ECC), can detect and correct single-bit errors and can detect some multi-bit errors while others can go undiscovered. Unfortunately, detected multi-bit errors will most of the time force the termination of the application and lead to a global restart. Thus, other strategies at the software level are needed to tolerate these type of faults more efficiently and to avoid a global restart. In this work, we extend the FTI checkpointing library to facilitate the implementation of custom recovery strategies for MPI applications, minimizing the overhead introduced when coping with soft errors. The new functionalities are evaluated by implementing local forward recovery on three HPC benchmarks with different reliability requirements. Our results demonstrate a reduction on the recovery times by up to 14%.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 708566 (DURO). This research is also supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Projects TIN2016-75845-P and the predoctoral grant of Nuria Losada ref. BES-2014-068066), and by the Galician Government (Xunta de Galicia) under the Consolidation Program of Competitive Research (ref. ED431C 2017/04).Peer ReviewedPostprint (author's final draft

    FPGA checkpointing for scientific computing

    Get PDF
    The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures. In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. We analyse the performance of several different applications and we understand what kind of overheads we can expect from checkpointing computational workloads running on FPGAs. Our results demonstrate overheads as low as 0.16% and 0.66% when checkpointing very frequently, indicating that this technique is efficient and does not add a significant amount of overhead to the system. In addition, we showcase a proof of concept for checkpointing partial data of the FPGA task itself. This can prove useful for workloads in which most data is offloaded to the FPGA memory at once and do not constantly move all the data between the accelerator and the CPU.This research has received funding from the European Union’s Horizon 2020 research and innovation programme under projects EuroEXA (grant agreement nº 754337) and eProcessor (grant agreement nº 956702).Peer ReviewedPostprint (author's final draft
    • …
    corecore